Understanding the shape of a distribution of data is of interest to people in a great variety of fields, as it may affect the types of algorithms used for that data. We study one such problem in the framework of distribution property testing, characterizing the number of samples required to to distinguish whether a distribution has a certain property or is far from having that property. In particular, given samples from a distribution, we seek to characterize the tail of the distribution, that is, understand how many elements appear infrequently. We develop an algorithm based on a careful bucketing scheme that distinguishes light-tailed distributions from non-light-tailed ones with respect to a definition based on the hazard rate, under natural smoothness and ordering assumptions. We bound the number of samples required for this test to succeed with high probability in terms of the parameters of the problem, showing that it is polynomial in these parameters. Further, we prove a hardness result that implies that this problem cannot be solved without any assumptions.
translated by 谷歌翻译
Many state-of-the-art deep learning models for computer vision tasks are based on the transformer architecture. Such models can be computationally expensive and are typically statically set to meet the deployment scenario. However, in real-time applications, the resources available for every inference can vary considerably and be smaller than what state-of-the-art models use. We can use dynamic models to adapt the model execution to meet real-time application resource constraints. While prior dynamic work has primarily minimized resource utilization for less complex input images while maintaining accuracy and focused on CNNs and early transformer models such as BERT, we adapt vision transformers to meet system dynamic resource constraints, independent of the input image. We find that unlike early transformer models, recent state-of-the-art vision transformers heavily rely on convolution layers. We show that pretrained models are fairly resilient to skipping computation in the convolution and self-attention layers, enabling us to create a low-overhead system for dynamic real-time inference without additional training. Finally, we create a optimized accelerator for these dynamic vision transformers in a 5nm technology. The PE array occupies 2.26mm$^2$ and is 17 times faster than a NVIDIA TITAN V GPU for state-of-the-art transformer-based models for semantic segmentation.
translated by 谷歌翻译
Multi-agent artificial intelligence research promises a path to develop intelligent technologies that are more human-like and more human-compatible than those produced by "solipsistic" approaches, which do not consider interactions between agents. Melting Pot is a research tool developed to facilitate work on multi-agent artificial intelligence, and provides an evaluation protocol that measures generalization to novel social partners in a set of canonical test scenarios. Each scenario pairs a physical environment (a "substrate") with a reference set of co-players (a "background population"), to create a social situation with substantial interdependence between the individuals involved. For instance, some scenarios were inspired by institutional-economics-based accounts of natural resource management and public-good-provision dilemmas. Others were inspired by considerations from evolutionary biology, game theory, and artificial life. Melting Pot aims to cover a maximally diverse set of interdependencies and incentives. It includes the commonly-studied extreme cases of perfectly-competitive (zero-sum) motivations and perfectly-cooperative (shared-reward) motivations, but does not stop with them. As in real-life, a clear majority of scenarios in Melting Pot have mixed incentives. They are neither purely competitive nor purely cooperative and thus demand successful agents be able to navigate the resulting ambiguity. Here we describe Melting Pot 2.0, which revises and expands on Melting Pot. We also introduce support for scenarios with asymmetric roles, and explain how to integrate them into the evaluation protocol. This report also contains: (1) details of all substrates and scenarios; (2) a complete description of all baseline algorithms and results. Our intention is for it to serve as a reference for researchers using Melting Pot 2.0.
translated by 谷歌翻译
用于对象检测的注释边界框很昂贵,耗时且容易出错。在这项工作中,我们提出了一个基于DITR的框架,该框架旨在在部分注释的密集场景数据集中明确完成丢失的注释。这减少了注释场景中的每个对象实例,从而降低注释成本。完成DETR解码器中的对象查询,并使用图像中对象的补丁信息。结合匹配损失,它可以有效地找到与输入补丁相似的对象并完成丢失的注释。我们表明,我们的框架优于最先进的方法,例如软采样和公正的老师,同时可以与这些方法一起使用以进一步提高其性能。我们的框架对下游对象探测器的选择也不可知。我们显示了多个流行探测器的性能改进,例如在多个密集的场景数据集中更快的R-CNN,CASCADE R-CNN,CENTERNET2和可变形的DETR。
translated by 谷歌翻译
在过去的几十年中,虚拟领域的许多方面都得到了增强,从亚马逊的Alexa和Apple的Siri等数字助手到出现到重新品牌的Meta的最新元元努力。这些趋势强调了产生对人类的影像性视觉描述的重要性。近年来,这导致了所谓的深层和说话的头部生成方法的快速增长。尽管它们令人印象深刻和受欢迎程度,但它们通常缺乏某些定性方面,例如纹理质量,嘴唇同步或解决方案以及实时运行的实用方面。为了允许虚拟人类化身在实际场景中使用,我们提出了一个端到端框架,用于合成能够语音的高质量虚拟人脸,并特别强调性能。我们介绍了一个新的网络,利用Visemes作为中间音频表示,并采用层次图像综合方法的新型数据增强策略,该方法允许解散用于控制全球头部运动的不同模态。我们的方法是实时运行的,与当前的最新技术相比,我们能够提供卓越的结果。
translated by 谷歌翻译
我们研究视觉变压器(VIT)的半监督学习(SSL),尽管VIT架构广泛采用了不同的任务,但视觉变形金刚(VIT)还是一个不足的主题。为了解决这个问题,我们提出了一条新的SSL管道,该管道由第一个联合国/自制的预训练组成,然后是监督的微调,最后是半监督的微调。在半监督的微调阶段,我们采用指数的移动平均线(EMA) - 教师框架,而不是流行的FixMatch,因为前者更稳定,并且为半手不见的视觉变压器提供了更高的准确性。此外,我们提出了一种概率的伪混合机制来插入未标记的样品及其伪标签以改善正则化,这对于训练电感偏差较弱的训练VIT很重要。我们所提出的方法被称为半vit,比半监督分类设置中的CNN对应物获得可比性或更好的性能。半vit还享受VIT的可伸缩性优势,可以很容易地扩展到具有越来越高的精度的大型模型。例如,半效率总数仅使用1%标签在Imagenet上获得令人印象深刻的80%TOP-1精度,使用100%ImageNet标签与Inception-V4相当。
translated by 谷歌翻译
在本文中,我们研究了如何在视觉和语言(V+L)表示学习中使用蒙版的信号建模。与其独立开发蒙面语言建模(MLM)和蒙面图像建模(MIM),我们建议建立关节蒙面的视觉和语言建模,其中一种模态的掩盖信号是在另一种方式的帮助下重建的。这是由图像文本配对数据的性质和文本传达几乎相同的信息但以不同格式传达的。在另一种模态下进行的一种模式的掩盖信号重建也可以隐式学习语言令牌和图像贴片之间的跨模式对齐。我们对各种V+L任务的实验表明,该建议的方法不仅可以通过使用大量数据来实现最先进的性能,而且还可以通过有限的培训数据的制度优于其他竞争对手。
translated by 谷歌翻译
大多数现有的作品在少数拍摄对象检测(FSOD)上的工作重点是从类似域中进行预训练和几乎没有弹出的学习数据集的设置。但是,在多个域中,很少有射击算法很重要。因此,评估需要反映广泛的应用。我们提出了一个多域数少数对象检测(MOFSOD)基准,该基准由来自各个域的10个数据集组成,以评估FSOD算法。我们全面分析了冷冻层,不同的体系结构和不同的预训练数据集对FSOD性能的影响。我们的经验结果表明,以前的作品中尚未探索过的几个关键因素:1)与以前的信念相反,在多域基准测试中,微调(FT)是FSOD的强大基线,在PAR上表现或更好最先进的(SOTA)算法; 2)利用FT作为基线使我们能够探索多个体系结构,我们发现它们对下游的几杆任务产生重大影响,即使具有类似的训练性能; 3)通过取消预训练和几乎没有学习的学习,MOFSOD使我们能够探索不同的预训练数据集的影响,并且正确的选择可以显着提高下游任务的性能。基于这些发现,我们列出了可能提高FSOD性能的调查途径,并对现有算法进行了两次简单修改,这些算法导致MOFSOD基准上的SOTA性能。该代码可在https://github.com/amazon-research/few-shot-object-detection-benchmark上获得。
translated by 谷歌翻译
在家庭场景(例如,对于智能演讲者)中的说话者身份(SID)是一个重要但具有挑战性的问题,因为标记的(注册)话语数量有限,声音和人口不平衡。传统的说话者识别系统从大量随机的扬声器样本中概括,从而导致识别从特定队列中汲取的家庭或以其他方式表现出高度混淆性。在这项工作中,我们提出了一种基于图形的半监督学习方法,以通过本地适应的图形归一化和多视图图的多信号融合来提高家庭级的SID准确性和鲁棒性。与其他关于家庭SID,公平性和信号融合的工作不同,这项工作着重于扬声器标签推理(评分),并提供了一种简单的解决方案,可以实现家庭特定的适应性和多信号融合,而无需调整嵌入或训练融合网络。 Voxceleb数据集的实验表明,我们的方法一致地改善了具有不同客户群和混淆程度的家庭的绩效。
translated by 谷歌翻译
联想记忆一直是大规模复发新皮层网络进行的计算的重要候选者。实施关联记忆的吸引者网络为许多认知现象提供了机械解释。但是,吸引子记忆模型通常是使用正交或随机模式训练的,以避免记忆之间的干扰,这使得它们对于自然存在的复杂相关刺激(如图像)而言是不可行的。我们通过将经常性吸引子网络与馈电网络相结合,该网络使用无监督的Hebbian-Bayesian学习规则来学习分布式表示形式。最终的网络模型涵盖了许多已知的生物学特性:无监督的学习,HEBBIAN可塑性,稀疏分布激活,稀疏连接性,柱状和层状皮质体系结构等。我们评估了FeefForward和Recurrent网络组件在复杂模式识别任务中对FeefForward和Recurrent Network组件的协同效应MNIST手写数字数据集。我们证明,经过训练在前馈驱动的内部(隐藏)表示上时,经常性吸引子组件会实现关联内存。还显示了关联内存可以从训练数据中进行原型提取,并使表示强大到严重失真的输入。我们认为,从机器学习的角度来看,提议集成的馈电和复发计算的整合尤其有吸引力。
translated by 谷歌翻译